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Scientific  learning  is  an  iterative  process  employing  Criticism  and 
Estimation.  Correspondingly  the  formulated  model  factors  into  two  complemen¬ 
tary  parts  —  a  predictive  part  allowing  model  criticism,  and  a  Bayes  posterior 
part  allowing  estimation.  Implications  for  significance  tests,  the  theory  of 
precise  measurement,  and  for  ridge  estimates  are  considered.  Predictive  check¬ 
ing  functions  for  transformation,  serial  correlation,  bad  values,  and  their 
relation  with  Bayesian  options  are  considered.  Robustness  is  seen  from  a 
Bayesian  viewpoint  and  examples  are  given.  For  the  bad  value  problem  a  com¬ 
parison  with  M  estimators  is  made. 
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SIGNIFICANCE  AND  EXPLANATION 


Scientific  method  is  a  process  of  guided  learning  in  which  accelerated 
acquisition  of  knowledge  relevant  to  some  question  under  investigation  is  achieved 
by  a  hierarchy  of  iterations  in  which  induction  and  deduction  are  used  in 
alternation. 

This  process  employs  a  developing  model  (or  series  of  models  implicit  or 
explicit)  against  which  data  can  be  viewed.  At  any  given  stage  of  the  investiga¬ 
tion,  the  current  model  approximates  relevant  aspects  of  the  studied  system  and 
motivates  the  acquisition  of  further  data  as  well  as  its  analysis.  By  the  use  of 
a  prior  distribution  it  is  possible  to  represent  some  aspects  of  such  a  model  as 
if  they  were  completely  known  and  others  as  if  they  were  more  or  less  unknown. 

Now  parsimony  requires  that,  at  any  given  stage,  the  model  is  no  more  com¬ 
plex  than  is  necessary  to  achieve  a  desirable  degree  of  approximation  and  since 
each  investigation  is  unique  we  cannot  be  sure  in  advance  that  any  model  we 
postulate  will  meet  this  goal.  Therefore,  at  the  various  points  in  our  investi¬ 
gation  where  data  analysis  is  required,  two  types  of  inference  are  involved: 
model  criticism  and  parameter  estimation.  To  effect  the  latter,  conditional  on 
the  plausibility  of  the  model,  and  given  the  data,  we  can,  using  Bayes'  Theorem, 
deduce  posterior  distributions  for  unknown  parameters  and  so  make  inferences  about 
them.  But,  before  we  can  rely  on  such  conditional  deduction,  we  ought  logically 
to  check  whether  the  model  postulated  accords  with  the  data  at  all  and,  if  not, 
consider  how  it  should  be  modified.  In  practice,  this  question  is  usually 
investigated  by  inspecting  residuals,  by  other  informal  techniques,  and  some¬ 
times  by  making  formal  tests  of  goodness  of  fit.  In  any  case  this  inferential 
procedure  of  model  criticism  whereby  the  need  for  model  modification  is  induced, 
is  ultimately  dependent  on  sampling  theory  argument.  These  principles  may  be 
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No  statistical  model  can  safely  be  assumed  adequate .  Perspicacious  criticism  employing 
diagnostic  checks  must  therefore  be  applied.  But  while  such  checks  are  always  neces¬ 
sary,  they  may  not  be  sufficient,  because  some  discrepancies  may  on  the  one  hand  be 
potentially  disastrous  and  on  the  other  be  not  easily  detectable.  In  addition 
therefore  it  is  often  pertinent  to  make  the  developing  model  robust  against  contin¬ 
gencies  to  which  it  is  currently  judged  sensitive. 

The  object  of  this  paper  is  to  review  the  complementary  roles  in  the  model 
building  process  of  the  predictive  distribution  and  of  the  posterior  distribution; 
the  former  in  producing  diagnostic  checks  of  parametric  as  well  as  residual  features 
of  the  model,  the  latter  in  providing  a  general  basis  for  robust  estimation. 

1.  Scientific  learnirg  and  statistical  inference. 

Much  of  statistics  is  concerned  with  extending  knowledge  by  building  empirico- 
mechanistic  models  that  involve  probability.  A  theory  about  such  scientific  model 
building  ought  to  explain  what  good  statisticians  and  scientists  actually  do.  It 
seems  that  scientific  knowledge  advances  by  a  practice-theory  iteration.  Known 
facts  (data)  suggest  a  tentative  theory  or  model,  implicit  or  explicit,  which  in 
turn  suggests  a  particular  examination  and  analysis  of  data  and/or  the  need  to 
acquire  further  data;  analysis  may  then  suggest  a  modified  model  that  may  require 
further  practical  illumination  and  so  on.  I  shall  suppose  that  data  are  acquired 
from  a  designed  experiment,  but  the  same  argument  would  apply  if  data  acquisition 
was  from  a  sample  survey  or  even  from  a  visit  to  the  library.  New  knowledge  thus 
evolves  by  an  interplay  between  dual  processes  of  induction  and  deduction  in  which 
the  model  is  not  fixed  but  is  continually  developing.  The  statistician's  role  is 
to  assist  this  evolution  (see  for  example  Box  and  Youle  (1955),  Box  (1976)).  In 

doing  so  he  employs  two  inferential  devices:  Criticism*  and  Estimation. 

*  . 

The  apt  naming  of  inferential  criticism  is  due  to  Cuthbert  Daniel.  See  also  Popper 
(1959)  . _ 
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Suppose  that  at  some  stage  i  of  an  investigation,  model  M,  is  being 
entertained. 

Criticism  can  induce  model-  modification.  It  involves  a  confrontation  of 
with  available  data  ^  (old  as  well  as  newly  acquired),  and  asks  whether  is 

consonant  with  y  and,  if  not,  how  not.  It  employs  a  process  of  diagnostic  check¬ 
ing  (see  for  example  Box  and  Jenkins  (1970)),  which  is  often  done  informally  using 
plots  of  various  kinds  of  residual  quantities,  or  more  formally,  with  tests  of  good¬ 
ness  of  fit  or  "tentative  overfitting"  procedures.  When  a  modification  to  has 

been  made,  this  new  model,  in  addition  to  confronting  the  same  data,  will  in  some 
cases  be  checked  against  new  data  generated  by  a  design  This  new  design 

will  be  chosen  to  explore  those  shadowy  regions  whose  illumination  is  judged  cur¬ 
rently  to  be  important  in  view  of  the  nature  of  the  modified  model  and  the  needs  of 
independent  verification. 

Estimation.  When  the  iteration  leads  to  a  model  worthy  to  be  entertained  it 
may  be  used  to  estimate  parameters  conditional  on  its  truth.  In  practice  such  esti¬ 
mation  is  used  not  only  at  the  termination  of  the  model  building  sequence  but  at 
many  stages  throughout  it.  This  is  because,  to  conduct  criticism  of  a  model,  it  is 
often  necessary  to  provisionally  estimate  parameters  at  intermediate  stages. 

In  any  such  iteration  many  subjective  choices  are  made,  conscious  or  uncon¬ 
scious,  good  or  bad.  They  determine  for  instance  which  plots,  displays  and  checks 
of  data  and  residuals  are  looked  at;  and  what  treatments  and  variables  are  included 
at  which  levels,  over  what  experimental  region,  in  which  transformation,  in  what 
design,  to  illuminate  which  models.  The  wisdom  of  these  choices  over  successive 
stages  of  development  is  the  major  determinant  of  how  fast  the  iteration  will  con¬ 
verge  or  of  whether  it  converges  at  all,  and  distinguishes  good  scientists  and 
statisticians  from  bed.  It  is  in  this  context  that  theories  of  inference  need  be 
considered.  While  it  is  comforting  to  remember  that  a  good  scientific  iteration 


is  likely  to  share  the  the  property  o£  a  good  numerical  iteration  -  that  mistakes 
often  are  self-correcting,  this  also  implies  that  the  investigator  must  worry  par¬ 


ticularly  about  mistakes  which  are  likely  not  to  be  self-correcting. 

1.1  Rival  theories  of  inference. 

The  distinction  between  model  criticism  and  parameter  estimation  has  not  always 

been  made  and  proponents  both  of  sampling  inference  and  Bayesian  inference  have  long 
sought  for  a  single  comprehensive  theory. 

I  believe  that,  subject  to  some  overlap  discussed  later,  sampling  theory  is 
needed  for  exploration  and  ultimate  criticism  of  an  entertained  model  in  the  light 
of  current  data ,  while  Bayes  theory  is  needed  for  estimation  of  parameters  condi¬ 
tional  on  the  adequacy  of  the  entertained  model.  On  this  vie>-'  (see  also  Box  and 
Tiao  (1973))  both  processes  would  have  essential  roles  in  the  continuing  scientific 
iteration  just  as  the  two  sexes  are  required  for  human  reproduction.  Attempts  to 
choose  between  two  entities  which  were  not  alternative  but  complementary  could  cer¬ 
tainly  be  expected  to  lead  to  contention,  paradox,  and  confusion  of  the  kind  we  have 
been  experiencing.  The  view  that  more  than  one  mode  of  statistical  reasoning  can  be 
useful  is  not,  of  course,  new  and  was  advanced  (however  with  a  different  emphasis 
and  conclusions)  by  R.  A.  Fisher.  See  also  in  particular  Dempster  (1971). 

1.2  The  need  for  prior  distributions. 

In  the  past  the  need  for  probabilities  expressing  prior  belief  has  often  been 
thought  of,  not  as  a  necessity  for  all  scientific  inference,  but  rather  as  a  feature 
peculiar  to  Bayesian  inference.  This  seems  to  come  from  the  curious  idea  that  an 
outright  assumption  does  not  count  as  a  prior  belief.  The  interconnection  between 
model  assumptions  and  prior  distributions  becomes  clear  when  it  is  remembered  that 
every  model  can  be  imagined  as  embedded  in  a  more  complex  one.  For  example  an  out¬ 
right  assumption  of  normality  can  be  modelled  by  a  suitable  parametric  family  of 
distributions  indexed  by  a  parameter  6,  which  has  a  sharp  prior  at  the  normal 


value.  I  believe  that  it  is  impossible  logically  to  distinguish  between  model 
assumptions  and  the  prior  distribution  of  the  parameters.  The  model  is_  the  prior 
in  the  wide  sense  that  it  is  a  probability  statement  of  all  the  assumptions  cur¬ 
rently  to  he  tentatively  entertained  a  priori.  On  this  view,  traditional  sampling 
theory  was  of  course  not  free  from  assumptions  of  prior  knowledge.  Instead  it  was 
as  if  only  two  states  of  mind  had  been  allowed — complete  certainty  or  complete 
uncertainty. 

One  illustration  of  how  implied  prior  knowledge  which  is  implausibly  imprecise 
can  lead  to  trouble  in  sampling  theory  is  the  famous  discovery  by  Stein  (1956)  of 
the  inadmissibility  of  the  multivariate  sample  mean.  Consider  for  example  the  usual 
one-way  analysis  of  variance  set-up.  The  prior  assumption  which  justifies  the 
shrinkage  estimator  (see,  for  example,  Box  and  Tiao  (196Ba),  Lindley  and  smith 
(1972))  that  the  group  means  y^  are  random  samples  from  some  normal  super¬ 
population  having  unknown  mean  and  varianoe  might,  in  appropriate  circumstances, 
be  eminently  reasonable.  It  is  easy,  however,  to  miss  the  lesson  which  is  to  he 
learned  from  such  examples.  Notice  that  there  are  many  circumstances  in  which  this 
"Model  II"  assumption  would  not  be  sensible  either.  For  example,  if  the  y's  were 
daily  batch  yields  from  some  production  process,  it  might  be  much  more  reasonable 
to  postulate  a  priori  that  they  followed  some  time  series  model  such  as  a  stationary 
autoregressive  process.  The  estimators  (Tiao  and  Ali  (1971))  then  derived  from 
Bayesian  means  are  net  Stein's  shrinkage  estimators,  but  alternative  estimators 
allowing  incorporation  of  relevant  sample  information  about  the  autocorre lation  of 
the  batch  means.  Thus  while  for  this  example,  except  as  a  numerical  approximation, 
we  ought  not  to  use  the  sample  means  as  estimates,  we  ought  not  to  use  Stein's 
shrinkage  estimates  either.  There  seems  no  logical  way  to  avoid  trouble  except  by 
the  explicit  prior  statement  of  the  model  we  wish  to  entertain. 
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1.3  Two  complementary  factors  from  Bayes'  formula. 

If  the  prior  probability  distribution  of  parameters  is  accepted  as  essential, 
then,  a  complete  statement  of  the  entertained  model  at  any  stage  of  an  investigation 
is  provided  by  the  joint  density  for  potential  data  y  and  parameters  6 
calculated  from 

p(y,8|A)  *  p(y|0,A)  p (0 1 A>  .  (1.1) 

m  these  expressions  A  is  understood  to  indicate  conditionality  on  all  or  some  of 
the  assumptions  in  the  model  specification.  This  model  (1.1)  means  to  me  that  cur¬ 
rent  belief  about  the  outcome  of  contemplated  data  acquisition  would  be  calibrated 
with  adequate  approximation  by  a  physical  simulation  involving  random  sampling  from 
the  distributions  p(y|0,A)  and  p(8|A). 

The  model  can  also  be  factored  as 

p  (y ,  9  j  A)  «  p(e|y,A)p(>  |a)  .  (1.2) 

In  particular  the  second  factor  on  the  right,  which  can  be  computed  before  any  data 
become  available, 

p(y|A)  «  /  p(y|8,A)p(B|A)d9  (1.3) 

is  the  predi ctive  distribution.  It  is  the  distribution  of  the  totality  of  all 
possible  samples  y  that  could  occur  if  the  assumptions  were  true. 

Hhen  an  actual  data  vector  y,  becomes  available 

-U 

p(yd.8|A)  -  p(§|yd-A)F (ydlA) 

and  the  first  factor  on  the  right  is  Bayes'  posterior  distribution  of  9  given  yfl 

p*?!?d'A^  “  Pfy^lS'A^POl*)  •  (1.5) 

But  of  equal  importance  is  the  second  factor 

p  (yd  |  A)  ■  /  p(yd|9,A)pf0|A)d0  ,  (1.6) 

the  predictive  density  associated  with  the  particular  data  yd  actually  obtained. 
Figure  1  illustrates  for  a  single  parameter  8  and  a  sample  yd  of  n  »  2 
observations. 
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If  the  model  is  to  be  believed,  the  posterior  distribution  p(8|y.  ,A)  allows 

*  -a 

all  relevant  estimation  inferences  to  be  made  about  8.  However,  if  ^  were  such 
as  would  be  very  unlikely  to  be  generated  by  the  model,  this  could  not  be  shown  by 
any  abnormality  in  this  factor,  but  could  be  assessed  by  reference  of  the  density 
p(yd|A)  to  the  predictive  reference  distribution  p  (y  | A)  or  of  the  density 
p{9i(yd)|A)  of  some  relevant  checking  function  g^  (y^)  to  its  predictive  distri¬ 
bution.  The  importance  of  the  predictive  distribution  and  the  possibility  of  using 
it  in  some  way  as  a  model  checking  device  has  been  discussed  by  a  number  of  authors. 
See  in  particular  Roberts  (1965),  Guttman  (1967),  Geisser  (1971,  75),  Geisser  and 
Eddy  (1979),  Dempster  (1971,  75)  and  Kadane  et  al  (1979).  Also  measures  of  surprise 
other  than  that  discussed  here  have  been  proposed,  for  example  by  Good  (1956). 

2.  Estimation  of  the  mean  of  a  normal  distribution. 

As  an  example  consider  a  sample  of  n  observations  drawn  randomly  from  a 

2 

normal  distribution  with  unknown  mean  0  and  known  variance  o  with  uncertainty 

about  the  mean  expressed  by  supposing  that,  a  priori,  8  is  distributed  normally 

2 

about  8„  with  known  variance  o„.  Then  conditional  on  the  adequacy  of  the  model, 
o  o  ■  ■ 

8  is  estimated  by  combining  data  and  prior  information  in  the  normal  posterior 
distribution 

p(8(y,A)  <*  (I-  +  l„)*exp{-  i-  (I-  +  I.)  (6  -  0)2}  (2.1) 

«.  y  o  z  y  o 

-2  -2  - 

where  I-  »  no  ,  1^  -  and  0  =  wy  +  (1  -  w)0Q  is  an  appropriately  weightsd 

average  of  y  and  0Q,  with  w  =  I-/(I_  +  Ig)  the  proportion  of  information  coming 
from  the  data. 

The  predictive  distribution  allowing  criticism  of  the  model  by  contrasting  data 
and  prior  information  is 


.  u.  -  (n-1)  .  o  A  2  -i 
p(y  A)  <*  g  (  — +  o  )  exp 

-  n  0 


s2  ,  (y-0O> 

+  sl  2 

n  e 


An  overall  predictive  check  is  supplied  by  calculating 


a  *  Pr{p(y|A)  <p(yd|A)} 


Pr  {y 


>  g  (yd) ) 


(2.3) 


where 


+ 


2 

a 


(2.4) 


As  an  example  suppose  the  sample  consists  of  n  =  4  analytical  tests  of  yield 

y'  =  (77,74,75,78)  performed  on  a  single  batch  from  an  industrial  process  for  which 
-d 

2 

it  is  believed  that  the  testing  variance  0=1,  the  process  mean  0^  =  70  and  the 

2 

batch  to  batch  variance  is  o„  =  2.  We  wish  to  estimate  the  mean  0  for  this 

0  - - 

particular  batch. 

In  this  example  yd=76,  0Q  =  70,  s*  =  3.33,  I-  =  4,  I0  =  0.5,  w  =  .89;  so  that, 
given  the  appropriateness  of  the  model  previously  discussed,  0  is  estimated  by  the 
normal  distribution  NtS.o2)  with  0  =  (.89  x76)  +  (.11  x70)  =  75.3, 
a  2=  (4  +  0.5)"1  =  0.22. 


However,  from  the  predictive  check 


g(y  ,  =  P*  -  701.  +  ULlil  =  26 

9id  2.25  1 


(2.5) 


and 

a  ■  Pr(y2  >26}  <  .001 
4 


(2.6) 


Thus  for  this  example  the  model,  and  hence  the  estimate  of  0  supplied  by  the 
posterior  distribution  N (75. 3,0.22) ,  is  discredited  by  the  predictive  check. 

Notice  the  following:  (a)  while  the  posterior  distribution  combines  infor¬ 
mation  from  data  and  prior  in  a  manner  which  is  entirely  appropriate  if  the  model 
is  to  be  believed,  the  predictive  distribution  contrasts  these  two  sources  of  infor¬ 
mation  and  checks  their  compatibility. 


(b)  The  predictive  check  formalizes  questions  that  any  competent  statistician 

would  raise  having  been  presented  with  the  supposed  form  of  the  model  and  the  data. 

2 

The  components  of  g(y.),  { (n  -  l)s.)/o  and  (y.  -  0.)  /{—  +  a.)  are  the  standard 
-  a  Q  a  u  n  o 

sampling  theory  checking  functions  for  contrasting  an  estimate  of  variance  with  a 
prior  value  and  contrasting  two  estimates  of  the  same  mean. 

(c)  In  making  this  predictive  check  it  was  not  necessary  to  be  specific  about 
an  alternative  model.  This  issue  is  of  some  importance  for  it  seems  a  matter  of 
ordinary  human  experience  that  an  appreciation  that  a  situation  is  unusual  does  not 
necessarily  depend  on  the  immediate  availability  of  an  alternative. 

(d)  Whereas  in  estimating  0  assuming  the  model  to  be  true  the  posterior  dis¬ 
tribution  makes  use  only  of  the  single  data  vector  v,  that  has  actually  occurred, 

-d 

by  contrast,  an  assessment  of  whether  the  sample  y^  is  likely  to  have  occurred 

at  all  is  necessarily  achieved  by  relating  y^  to  a  relevant  reference  set  of  all 

data  vectors  y  which  could  have  occurred  with  the  model  true. 

Inspection  of  the  global  function  g  (y  )  alone  would  rarely  ensure  adequate 

~d 

checking  of  the  model.  In  this  example,  for  instance,  it  would  be  natural  to  con- 

„  2 

sider  the  individual  contributions  from  y  and  s  not  only  so  that  they  could 

a  a 

2  2 

be  separately  considered,  but  also  because  unusually  small  values  of  (n  -  l)s./s 

a 

as  well  as  unusually  large  ones  could  point  to  model  inadequacy.  Also  if  n  were 

larger,  we  might  wish  to  consider  other  functions  g.  (y.)  of  the  data  such  as 

x  -d 

moment  coefficients  and  serial  correlation  coefficients  which  could  reveal  model 
inadequacies  believed  important  in  the  current  experimental  situation.  This  could 
be  done  by  referring  p{g^(yg)|A)  to  the  predictive  distribution  p{g^(y)|A) 
derived  by  appropriate  integration  of  p(y[A).  Associated  with  these  more  specific 
checks  are  (possibly  vague)  model  alternatives,  the  logical  consequences  of  which 
are  discussed  in  Section  4.6. 

In  practice,  criticism  of  the  model  is  often  conducted  by  visual  inspection  of 
residual  displays  and  other  more  sophisticated  plots.  But  such  a  process,  although 
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it  is  informal,  seems  to  me  to  fall  within  the  logical  framework  described  above. 


the  plots  are  designed  to  make  manifest  certain  "features"  in  the  data  that  would 
rarely  be  extreme,  if  the  model  were  true.  If  such  a  feature  cam  be  described  by  a 
function  g^(y^)  <  its  unusualness,  if  formalized,  would  be  measured  appropriately 
by  reference  to  p{g^(y))A}. 

For  the  above  example  obvious  functions  for  checking  individual  features  of  the 
2 

model  are  y,  s  and  suitably  chosen  functions  of  standardized  residuals 
r  =  (r  , . ,.,r  )'  with  r^  =  (y^  -  y)/s,  i  *  l,...,n.  These  would  usually  include 
the  individual  residuals  themselves  plus  other  functions  which,  depending  on  the 
context,  might  include  checks  for  needed  transformation,  heteroscedasticity ,  serial 
correlation,  "bad  values",  skewness  and  kurtosis.  See  for  example  Anscombe  (1961), 
Anscombe  and  Tukey  (1963),  Andrews  (1971a  and  b) . 

The  standardized  residuals  can  be  expressed  more  conveniently  in  terms  of 
n  -  2  independently  distributed  functions  obtained  by  making  en  orthogonal  trans¬ 
formation  from  y  to  Y  =  (Y,  ,Y_ , . . .  ,Y  )'  with  Y  =  Jn y  ar.d  then  transforming 
-  -  12  n  n 

2 

to  y,s  and  u  where  u  is  a  vector  of  n  -  2  residual  quantities 

u  =  (u, ,u_, . . . ,u  .) '  such  that 
-  12  n-2 


u.  =  Y.  , 
1  3+1 


Ak  </>) % 


(2.7) 


The  Jacobian  of  the  transformation  from  y  to  y,s  ,u  is  proportional  to 
n-1  , 

2  — 2 *  ^  2  Vl  ‘+1) 

(s  )  (I  {1  +  u./j}  2  .  After  transformation  the  predictive  distribution 

3=1  1 

contains  n  elements  distributed  independently 

p(y,s2,u|A)  =  p(y|A)p(s2 |A)p(u| A)  (2.8) 

where 
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(2.9) 


p(y|A)  “  (Og  +  o2/n)  *exp{- j  (y- 0Q)  /(c^  +  0  /n)) 
p(s2|A)  «  (o2)‘i(n-1,(s2}  2  ^xpj-i-Cn  -  1)#2/<J2J 


p(u|a) 


A  {*  *  if} 


uVitj+l) 


(2.10) 

(2.11) 


The  unusualness  of  g, 


g^  m  $  and  of  any  residual  functions  of  interest 


g-,g.t. ..  ,g  can  then  be  assessed  by  computing 
3  4  q 


Pr  {p  (9j  I  A)  <  p(g^dlA))  j  »  1,2,.. .,q 


(2.12) 

which  for  unimodal  distributions  will  be  tail  area  probabilities.  For  this  example 

these  would  be  obtained  by  referring 

_  2  2  i 

(i)  (y.  -  6  )/(g*  +  o  /n)*  to  the  Normal  table 

<3  0  6 

(ii)  (n  -  l)s2/c2  to  the  x*  table 

(iii)  g  j  to  reference  distributions  obtained  by  appropriate  integration 

^3d  qd 

of  p  (u  |a)  . 

These  probabilities  are  of  course  affected  by  transformation.  Thus  tne  answer  will 
be  a  little  different  depending  for  example  on  whether  we  ask  a  question  abuat  j 
or  about  s2.  I  do  not  find  this  particularly  disturbing.  Slightly  different  ques¬ 
tions  can  be  expected  to  have  slightly  different  answers.  Ms  now  illustrate  some 
invalidations. 

2.1  Significance  tests 

2  2 

Suppose  o  is  assumed  small  compared  with  o  /n,  so  that  w ,  the  relative 
9 

amount  of  information  supplied  by  the  data,  is  close  to  zero.  Then,  if  this  model 
can  be  relied  upon,  the  posterior  distribution  will  be  essentially  the  same  as  tlae 
prior,  sharply  centered  at  0Q.  A  practical  context  is  one  where  the  statistician 
is  told  that  the  process  mean  is  known  to  be  0Q  and  the  batch  to  batch  variance  o‘ft  is 
negligible  compared  with  testing  variance  o2.  If  he  believed  this  model,  then  any 
data  y  could  do  very  little  to  change  his  belief  that  0  *  9  .  However,  it  could 
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deny  the  relevance  of  this  model.  In  particular  g  ^  (y  ^ )  now  involves  essentially 

the  reference  of  (y.  -  0„)/(o /Ji\)  to  normal  tables;  the  failure  of  this  check  means 
a  0 

that  the  model  is  discredited  and  therefore  the  Bayes  calculation  that  leads  to  a 
sharp  posterior  distribution  at  8^  may  not  logically  be  undertaken. 

The  above  most  satisfactorily  explains  to  me  the  rationale  of  a  significance 

test. 

(a)  The  tentative  model  (null  hypothesis)  implies  that  9  is  close  to  0^. 

(b)  A  check  on  the  compatibility  of  this  model  and  the  data,  so  far  as  the 

mean  is  concerned,  is  provided  by  reference  of  (y  -  9_)/(o  /viT)  to 

d  0 

the  Normal  Table. 

(c)  If  the  tail  area  probability  is  not  small  we  do  not  question  the  model. 
The  application  of  Bayes1  theorem  then  produces  a  posterior  distribu¬ 
tion  which  is  sharply  centered  at  9^.  We  have  "no  reason  to  question 
the  null  hypothesis". 

(d)  If  the  tail  area  probability  is  small  we  conclude  that  the  model 
which  postulated  that  0  =  9Q  is  discredited  by  the  data, 
i.e.,  the  "null  hypothesis  is  discredited". 

(e)  Notice  too  that  although  the  failure  of  this  check  would  r  ->st 

inmediately  proscribe  the  use  of  Bayes'  theorem,  the  failure  of  other 

2 

checks  (and  of  that  based  on  s  in  particular)  would  also  suggest 
the  need  for  model  modification  before  proceeding  further. 

A  difficulty  that  this  removes  for  me  is  that,  as  usually  formulated,  signifi¬ 
cance  tests  had  seemed  to  provide  no  basis  for  belief.  On  this  formulation  however 
the  significance  test  provides  a  means  of  discrediting  a  model  which  if  accepted 
would  inevitably  inply  acceptance  of  the  belief  that  9  lay  close  to  9^.  It  is 
admitted  that  this  foimulation  does  not  cover  all  possible  circumstances  in  which 
significance  tests  have  been  used  (see  in  particular  Cox  (1977)),  but  it  is  arguable 
that  other  applications  are  best  dealt  with  in  other  ways. 


2  2 

Suppose  now  that  is  assumed  large  compared  with  o  /n,  so  that  1  -  w, 

which  measures  the  proportion  of  the  information  about  6  coming  from  the 

prior,  is  close  to  zero.  Then  dominates  the  denominator  in  the  predictive 

0 

2  2  - 

checking  function  (a.  +  a  /n)  (y  -  8.)  implying  that  the  model  would  not  be  called 
o  0 

into  question  by  sets  of  data  having  widely  different  sample  averages.  This  is  the 
situation  where  we  can  invoke  what  L.  3.  Savage  called  the  "theory  of  precise  measure¬ 
ment'  to  justify  the  very  useful  numerical  approximation  of  the  posterior  distribution 
-  2 

by  N(y,<j  /n).  Now  since  the  predictive  distribution  for  y  does  not  exist  at  the  limit 
1  -  w  -  0  when  this  limiting  posterior  distribution  is  obtained,  it  might  be  argued 
that,  when  precise  measurement  theory  is  appropriate,  we  have  a  license  to  apply  Bayes' 
theorem  without  any  restraining  checks  on  the  model.  Obviously  however  in  any 
imaginable  experiment  il  situation  there  would  be  values  of  y  which  would  rightly 
be  regarded  as  implausible  given  the  investigator’s  current  beliefs.  Thus  what  is 
really  being  verified  is  that  a  non-informative  prior  must,  to  make  practical  sense, 
always  be  proper,  even  though  the  appropriate  posterior  distribution  can,  in  suit¬ 
able  circumstances,  be  numerically  approximated  by  substituting  an  improper  prior. 

3.  The  normal  linear  model 
Suppose 

y  ~  N(lu  +  X8,I  o2)  (3.1) 

•  -  - -n 

with  1  a  vector  of  unities  and  X  of  full  rank  k  such  that  X* 1  ■  0  and 
suppose  that  prior  densities  are  locally  approximated  by 

U  -  N<vc-V>,e  -  N(e0,r_1a2),{o2/v0s2)  ~  x’2(v0)  (3.2) 

2  -2 

with  y  and  8  independent  but  conditional  on  o  ,  and  x  (Vg)  the  inverted 
chi  square  distribution. 


Given  a  sample  special  interest  attaches  to  6  and  o  which  given  the 

2 . 

assumptions  are  estimated  by  p(8,o  ly^.A)  with  marginal  distributions 


p(8|yd,A)  *  <1  + 


<8  -  8.)'  (X'X  +  D  <8  -  8.) 
«  -a  *  -  *  -a 


(n  +  vq)5^ 


■  J  (n+vQ+k) 


(3.3) 


with 


p(a2|yd,A)  «  o  *n+vo+2*exp{- 5- (n  +  wr,)o2/°2) 


0  d' 


(3.4) 


0.  -  (x'x  +  r)_1  (x1  x8  +  re  ),  8.  -  (x,x)"1x,yJ,  v  «  n  -  k  -  1  , 

~a  -  ■*  *  *--a  - -0  -a  -  *  *  ta 

(n  +  v0)52  -  vs2  +  vQs2  +  <§a  “  Q0)*  ( (X’X)"1  +  r_1}_1  (§d  '  V  +  (n"1+c"1)'1<?-w0)2  ' 


Now  let  be  the  pooled  estimate 


(v  +  v0)_1(vs2  +  v0s2) 


(3.5) 


Then  the  predictive  distributions  for  (8  -  80)/Sp(  3  •  3,1,1  the  v  -  1  elements 

of  the  residual  vector  u,  defined  in  an  analogous  manner  to  that  previously  employed 
in  (2.7),  are  independent  and  are  given  by 


P((8  -  80)/sp!A) 


<e  -  oj'Ux'x)’1  +  r-1}-1  (§  -  9jri(n+V1) 


1  + 


”0' 


-0 


(v  +  v„)s 

0  p 


(3.6) 


2  2  ■  iv-lf  v  ■)”ifv+v0^  2  2 

p(s  /sQ|A)  a  f4  |l  +  —  ,  F  »  s  /sQ 


(3.7) 


v-X  * 

p(u|a)  «  n  a  +  (u2/j)}-i<j+1) 
j-i  3 

The  predictive  check  derived  from  (3.6) 

Pr{p((9  -  80)/sp|A)  <  p((|d  -  e0)/3pd|A) ) 


F.  ^  > 

k,v+v0 


(8.  -  en),{(x’x)"1  +  r_1}"l(S.  -  e„) 

-d  -0  -  -  -  -d  -0 


ks‘ 


pd 


(3.8) 


(3.9) 
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is  the  standard  analysis  of  variance  check  for  compatibility  of  two  estimates  §d 
and  6q.  It  was  earlier  proposed  as  a  check  for  compatibility  of  prior  and  sample 
information  by  Theil  (1963).  The  predictive  check  derived  from  (3.7) 

Pr{p(s^|A)  <  p(s^|A)}  yields  the  F  test  having  v  and  degrees  of  freedom 

2  2 

appropriate  to  check  whether  the  two  estimates  sd  and  sQ  are  compatible. 
Residual  checks  derived  from  (3.8)  are  obtainable  as  before. 

3.1  Ridge  estimates 

Now  suppose  the  X  matrix  to  be  in  correlation  form  and  assume  e  -  0, 

2  2 

T  ■  vq  •*  0  so  that  s  s  .  Then  the  estimates  0^  are  the  ridge  esti¬ 

mators  of  Hoerl  and  Kennard  (1970)  which,  given  the  assumptions,  appropriately 
combine  information  from  the  prior  with  information  from  the  data.  The  predictive 
check  (3.9)  now  yields 


Pr 


8^{(X*X) 


-1 


^o1}‘ 


1e. 


k.v 


ks. 


(3.10) 


allowing  any  choice  of  yQ  to  be  criticised. 

For  example,  in  their  original  analysis  of  the  data  of  Gorman  and  Toman  (1966), 

Hoerl  and  Kennard  (1970)  chose  a  value  y -  .25.  However  substitution  of  this  value 
in  (3.10)  yields  a  =  PrlF^g  25  >  3.59}  <  0.01  which  discredits  this  choice.  More 
recently  it  has  been  pointed  out  (Lindley  and  Smith  (1972),  Hoerl,  Kennard  and  Baldwin  (1975. 
that  given  the  model,  y  can  be  estimated  from  the  data.  If  we  do  this,  much  smaller 
values  of  y  are  obtained  which  of  course  are  not  in  conflict  with  the  wider  model. 

The  two  kinds  of  analysis  further  illustrate  the  overlap  between  predictive  checking 
and  Bayesian  estimation  later  discussed  in  Section  4.6. 

The  Bayes  approach  to  ridge  estimators  has  the  characteristic  advantage  that 
the  somewhat  arbitrary  prior  assumptions,  which  have  to  be  made  even  for  compatible 
values  of  y,  are  uncovered  for  criticism  (see  also  Draper  and  Van  Nostrand  (1977)). 

If  y^  *  0,  (3.10)  yields  the  standard  AMOV.A  significance  test  which  has  a  detailed 

interpretation  parallel  to  that  set  out  in  Section  2.1. 
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4 .  Di agnostic  checks 


It  is  useful  to  distinguish  two  kinds  of  checks  which  may  be  called  respectively 
Overall  or  Multidirectional  checks  and  Specific  or  Unidirectional  checks.  An  example 
the  first  would  be  a  general  inspection  of  residuals  and  the  second  a  Durbin-Watson 
test  for  first  order  serial  correlation.  This  distinction  is  made  for  example  by 
Box  and  Jenkins  (1970)  in  their  discussion  of  the  general  philosophy  of  diagnostic 
checking.  Concerning  these  two  kinds  of  checks  these  authors  say  "...  although 
(overall  checks]  can  point  out  unsuspected  peculiarities  ...  [they]  may 
not  be  particularly  sensitive.  Tests  for  specific  departures  ...  are  more  sensitive, 
but  may  fail  to  warn  of  trouble  other  than  that  specifically  anticipated."  The  two 
alternatives  ought  properly  to  be  regarded  as  extremes  on  some  scale  of  dependence  of 
checking  procedures  on  specific  alternatives.  For  example  consider  the  fitting  of  a 
parametric  time  series  model.  While  residuals  themselves  should  always  be  inspected 
there  are  a  number  of  way-stations  between  this  overall  but  insensitive  check  anl  the 
device  called  "overfitting"  in  which  a  model  is  tentatively  elaborated  in  a  specific 
direction.  Thus  inspection  of,  and  application  of  overall  tests  to,  the  auto¬ 
correlation  function  and  the  periodogram  of  the  residuals  while  still  non-specific 
is  less  general  than  the  first  device  and  much  less  specific  than  the  second. 


The  model  checking  problem  is  comparable  to  that  faced  by  a  nation  which  fears 
aerial  attack  that  might  come  from  any  direction  but  with  certain 
rather  wide  zones  more  likely  than  others  and  certain  specific  directions  believed 
especially  likely.  How  should  limited  radio  detection  devices,  which  are  less  sensi¬ 
tive  the  less  they  are  focused,  be  deployed?  The  best  solution  obviously  involves 
some  combination  of  wide  and  more  specific  searches,  and  theoretically  could  be 
achieved  knowing  prior  probabilities  and  expected  losses.  Correspondingly, 
the  competent  statistician  must,  in  a  variety  of  contexts,  be  able 
to  make  intelligent  guesses  not  only  of  what  discrepancies  are 

particularly  likely,  but  which  are  potentially  disastrous, and  to  allocate  his  effort 
accordingly.  In  practice  this  is  done  informally  and  is  part  of  what  an  adequate 
training  in  statistics  achieves. 

4.1  Checking  parametric  features  of  the  model. 

In  the  examples  considered  above  where  sufficient  statistics  were  available 
parameter  preferences  evidenced  by  proper  priors  were  directly  challenged,  leading 
without  a  direct  statement  of  alternatives  to  appropriate  checks.  When  a  specific 
set  of  assumptions  alternative  to  A^  are  in  mind  then  an  appropriate  checking 

function  might  also  be  obtained  from  the  predictive  ratio 

p^V^^V*  (4,1) 
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He  shall  not  explore  this  possibility  further  hero  except  to  note  that  this  ratio  is 
a  component  in  the  direct  assessment  of  Bayesian  odds  to  which  we  refer  briefly  in 
Section  4.6. 


4.2  Checking  residual  features  of  the  model. 

Residual  checking  functions  are  sometimes  chosen  on  an  ad  hoc  basis  and  some¬ 
times  using  specific  models.  I  think  the  best  course  is  again  to  employ  an  iteration  - 
this  time  between  theory  and  intuition.  An  empirical  procedure  that  works  well 
invites  the  question:  What  kind  of  model  would  be  needed  for  its  justification? 

Such  a  model  can  then  be  considered  for  use  in  a  wider  context.  For  instance 
exponential  smoothing  and  the  "three  term"  controller  were  both  enpirically  developed 
techniques  found  to  be  practically  effective.  ARIMA  time  series  models  are  general¬ 
izations  of  the  stochastic  processes  that  could  justify  these  methods  (Box  and 
Jenkins  (1970)).  In  a  similar  way  the  practical  usefulness  of  such  things  as  the 
jackknife  and  cross  validation  implies  the  existence  of  corresponding  models  which 
are  worthy  of  further  analysis. 

The  distinction  between  parametric  features  of  the  model  and  residual  features 
is  of  course  arbitrary  and  a  matter  of  convenience.  In  practice  the  needs  of 
parsimony  urge  us  to  settle  for  reasonably  simple  models  and  to  consider  possible 
deviations  from  them.  Consider  now  therefore  an  interesting  but  by  no  means  unique 
method  for  obtaining  an  appropriate  function  of  the  data  for  informal  or  formal 
checking  for  a  particular  kind  of  deviation  from  a  current  model  parametrized  by  a 
discrepancy  parameter  6. 

Suppose  the  predictive  distribution  conditional  on  some  specific  choice  of  B 
is  p(y|8).  Then  a  scaleless  function  of  the  data  alone,  appropriate  to  measure 
discrepancies  from  the  value  BQ  taken  in  the  current  model  is  provided  by  Fisher’s 
score  function 


3tnp  (y  1 6) 


B»B. 

0 


(4.2) 


J 
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we  Illustrate  by  considering  some  possible  discrepancies  from  the  standard 
normal  linear  model.  First  consider  the  model  when  there  is  no  discrepancy  so  that 
0  »  BQ,  and  using  the  structure  of  (3.1)  write 

§'  -  <u;0’>»  X-  (ljx)  (4.3) 

For  simplicity  we  here  suppose  that  the  distributions  of  0  and  tno  are  locally 
flat  a  priori  so  that  p(0,c)  »  constc  \  Then  p(y|Sg)  is  locally  approximated 
by  the  singular  distribution 


p(X|80)  "  const  S 


(4.4) 


where  S'*  «  J  (y,  -  y. )  *  y'Ry  and  R  •  I  -  M  with  M  -  X(X’X)  X'.  If  we 
i«l  1  1  ***  ~  . 

transform  to  0,  S,  and  u  then  the  standardized  residuals  u  which  are  functions 
of  v  -  1  angles  are  distributed  as  in  (3.8)  and, 

p  (3,s,u  [  Bq)  a  const  S-1p(u|B0)  U.5) 


To  see  the  reasonableness  of  this  set-up  notice  that  by  invocation  of  the 
linear  model  the  investigator  in  effect  predicts  that  the  sample  point  y  will  lie 
somewhere  close  to  a  hyperplane  h ^  spanned  by  the  columns  of  X.  The  formulation 
above  interprets  "somewhere  close  to"  as  follows.  Consider  a  future  sample  y  in 
relation  to  (0,S)  where  6  are  the  k  +  1  coordinates  of  the  projection  y  of 
y  on  h and  S  is  the  perpendicular  distance  of  y  from  h^.  Equation  (4.5) 
says  that  locally  any  value  of  0  is  equally  acceptable  but  that  the  density  for 
the  distance  S  will  fall  off  inversely  with  S. 

To  obtain  gg  (y)  we  need  to  determine  how  p  (y  |  8)  depends  on  the  discrepancy 
parameter  B  in  the  neighborhood  of  B  ■  BQ. 

4.3  A  check  for  needed  power  transformation. 

Especially  when  y  ^y^n  is  large  some  transformation  of  the  data,  for 
example  y**'  ■  (y*  -  1)/X,  might  permit  closer  representation.  Following  the 
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approximate  argument  of  Box  and  Cox  (1964),  with  X  the  discrepancy  parameter  and 

y  the  geometric  mean  of  the  y's,  and  for  X  close  to  1, 

.  I,,  .  .v(X-l)  -iv  . 

p(y|X)  <*  y  Qx  (4.6) 

where  the  omitted  constant  does  not  depend  on  y  or  on  X  and  where 


(X)'  (X) 

y  Ry 


3inp  (y  |  X) 


3X 


z'  Ry/s 


X=1 


1  n 
I  z . r . 

L  i  i 


(4.7) 


i=l 

y<  -  y< 


2  x  i 

where  z.  =  y.  {1  -  £n(y  ,/y) ),  s  =  y' Ry/v,  and  r.  =  - - - 

IX  i  — -  X  s 


The  predictive  check  may  thus  be  performed  by  regressing  the  residuals  y  -  y  on 
the  residuals  z  -  z  of  the  constructed  variable  z  =  y{l  -  fn(y/y)},  which  accords 
with  a  proposal  of  Atkinson  (1973).  The  check  can  be  made  informally  by  plotting  one  set 


of  residuals  versus  the  other.  More  formally  the  distribution  of  g^ty)  is  not 


precisely  known  although  an  approximate  level  can  be  obtained  by  computer  simulation. 
Relation  to  other  proposed  checks 


Related  checks  proposed  by  Tukey  (1949)  and  by  Andrews  (1971a)  correlate  the 

-.2 


original  residuals  with  those  from  the  constructed  variables  (y  -  y)  and  y£ny 
respectively.  Both  possess  the  advantage  of  having  exactly  known  sampling  distributions. 


For  illustration  we  consider 

(a)  the  biological  data  of  Box  and  Cox  (1964),  for  which  they  recommend  a 
reciprocal  transforma  cion, 

(b)  the  trapping  data  of  Snedecor  and  Cochran  (1967),  for  which  they  recommand 
a  log  transformation. 

Figures  2(a)  and  (b)  show  plots  of  residuals  y  -  y  against  residuals  from 

y(l  -  tn(y/y)}  and  - (y  -  y)2.  The  correlation  coefficient  for  the  latter  transforms 
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directly  to  give  Tukey's  one  degree  of  freedom  for  non-additivity.  Plots  for  the 
constructed  variable  yfny  are  not  shown  since  they  are  essentially  identical  to 
the  Tukey  plot.  The  relationship  between  these  various  procedures  can  be  seen  by 
noting  that  z  =  y{l  -  tn  (y/y) }  may  be  closely  approximated  by 


z  =  y  -  B  (y  -  y)2 


Thus  after  writing 


yi  ‘  y  yi  "  yi  yi  "  y  y  -  y 

— -  -  — - -  +  — -  +  y- - £  ■=  r.  +  Y,  +  d 

s  s  s  s  i  i 


and 


g,  (y)  k  -  Ur.  +  Y.  +  d)2r.  -  -{Ir?  f  2Er2Y.  +  Er.Y2+2£r2d) 
^X  t  i  i  i  i  li  ii  i 


gX(y)  “  -  (T30  +  2T21  +  T12  +  2vd) 


(4.8) 


(4.9) 

(4.10) 

(4.11) 


where  the  T^  are  checking  functions  proposed  by  Anscombe  (1961)  and  Anscombe  and 
Tukey  (1963).  See  also  Box  and  Cox  (1964).  In  particular  T^  is  the  component 
associated  with  Tukey's  one  degree  of  freedom  for  non-additivity.  The  approximation 
shows  how  g^(y)  jointly  employs  skewness  (T  ) ,  dependence  of  variance  on  level 
(T^),  as  well  as  transformable  non-additivity  (T^)  to  indicate  the  need  for 
transformation . 

The  Box  and  Cox  data  were  generated  by  a  3  x4  factorial  with  four-fold  repli¬ 
cation  supplying  a  good  deal  of  information  about  the  variance  as  a  function  of 
location.  It  is  not  surprising  therefore  (see  also  Atkinson  (1973))  that  for  this 

example  g^(y)  is  considerably  more  sensitive  than  T  (or  almost  equivalently, 

than  Andrew's  criterion)  as  a  measure  of  the  need  for  transformation.  By  contrast 
the  Snedecor  and  Cochran  data  is  from  an  unreplicated  3x5  arrangement  where  most 
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of  the  information  cones  from  T. 


measuring  non-additivity. 


4.4  A  check  tor  serial  correlation 

For  data  known  or  suspected  to  have  been  taken  in  a  specific  serial  order  in 
time  or  space,  a  model  that  permitted  the  errors  to  follow  a  first-order  autoregres¬ 
sive  process  with  parameter  |$|  <1  might  provide  an  improved  approximation.  The 

-1  2 

dispersion  matrix  for  the  n-dimensional  vector  of  errors  e  would  then  be  W,  a 

*  "9 

where  ft  is  a  symmetric  continuant  with  principal  diagonal 
9 

{1,1  +  9  ,1  +  9  , . . . ,1  +  9  ,1}  and  with  all  the  elements  of  super-  and  sub-diagonals 

equal  to  -9.  Thus  in  particular  W_  ■  I.  Then 

-0 


ptyh)  *  a  -  92>ic;iv 


(4.12) 


where  -  y'R^y  and  R^  -  W^  -  with  -  W^X  (X'W^X)"1^’^.  Then 

aw,  9 


with 


39 


^•0 


-C  where  Q  is  n  xn  with  unities  in  super  and  sub-diagonals  and 


zeros  elsewhere,  after  some  algebraic  manipulation 


<y)  .  3fnp(y|9) 
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(4.13) 


where  R  -  R„ 


Thus  n-1 

V**  4  ^  Vi+l  (4.14) 

which  is  a  multiple  of  the  sample  first  lag  autocorrelation  of  the  residuals  from  the 
fitted  model.  This  points  to  the  sensitive  graphical  diagnostic  procedure  of  plotting 
residuals  r^+^  against  r^  and  yields  the  standard  checking  function  of  Durbin 
and  Hatson  (1950). 

4.5  A  check  for  bad  values 

Competent  investigators  have  over  the  centuries  treated  data  as  possibly  contain¬ 
ing  atypical  values,  see  for  example  Stigler  (1973).  This  implies  that  they  would  notreall\ 
have  believed  standard  textbook  models  of  the  kind  y^  ■  fte.x^  +  (i  «  l,2,...,n) 
which  state  that  the  same  structure  is  appropriate  for  every  one  of  a  sample  of  n 
observations. 
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When  it  is  unknown  which  observations  are  dubious  a  more  credible  "contaminated" 
model  proposed  by  Jeffreys  (1932),  Dixon  (1953)  and  by  Tukey  (I960)  supposes  that 
there  is  a  probability  a  that  any  given  observation  is  "bad"  (cannot  be  represented 
by  the  ideal  model).  Given  a,  let  p(yja)  be  the  predictive  distribution  and  let 
p(b|a)  denote  the  probability  of  getting  b  bad  values,  then  (Box  and  Tiao  (1968b), 
Bailey  and  Box  (1980a)) 


p(y|“)  *  Z  0ab<1  - n)n  bp(ylb) 

b<=0 


and 


V*5 


3tnp  (y|a) 


3a 


=  -  4 

a.0  Wlb  *  °>  > 


(4.15) 


(4.16) 


Now  let  2^  indicate  that  the  ith  observation  is  bad,  then 


,  n 

p(y|b  =  1)  »  -  l  p(yjz  ) 
n  i-1  '  1 


(4.17) 


l  ptyl*,) 

i«l  *  1 

ga(V  *  ~ply] b~ oT  ‘  "  • 

Depending  on  experimental  circumstances,  there  are  a  variety  of  ways  in  which  bad 
values  might  be  modeled.  In  particular,  contamination  could  come  from  increased  error 


variance,  unknown  bias,  and  mistaken  sign.  The  last  possibility  was  suggested  by 

Barnard  (1978)  to  account  for  two  suspiciously  large  outliers  in  Darwin's  data  oi 

cross  and  self  fertilized  plants,  quoted  by  Fisher  (1935). 

For  illustration  consider  the  first  possibility.  With  one  bad  value,  suppose 

-12  2 

the  error  covariance  matrix  a  has  all  elements  equal  to  o  ,  except  for  t.ie 

th  2  2 

i  element  which  is  equal  to  k  a  (k  >  1).  Then 
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0  - 
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where 


v.2  -  y'Ry,  vs2  -y'Rjy  ■  y'Cw^  -  w^XWW^X)  ^X'w^y 


(4.19) 


where  ■  x  -  qG^,  q  ■  1  -  *y  and  Is  an  n  *n  matrix  with  a  single  unity 

X 

for  the  ith  diagonal  element  and  all  other  elements  zero.  Now 


2  2 
VS^  "VS  - 


!^r  -  »,>* 


where  v^  ■  Var  (y^)/<J2  -  x^()<'>()”1xi  and  y^  -  y^  is  the  ith  residual  from  the  ideal 
model  fit,  y  -  y  *  fy. 


Thus  finally  g  (y)  »  x 

Cl  - 


■if  n  V- 

fc-T-i) D  - n  where 
>  •  r{‘  •  TiT^V  '!}" 


(4.20) 


This  is  the  sinqplest  form  for  computation.  The  nature  of  this  checking  function  D 
can  however  be  more  clearly  seen  by  writing  it  in  terms  of  the  weighted  residuals 
ij*  (yA  -  y^/Si  where  Y  ~  i  m  Thua 


EU  + 


qd  -  qw4)  .2A? 


if 


Thus  D  is  proportional  to  the  sum  of  the  reciprocals  of  the  n  residual  t  ordi¬ 
nates  obtained  by  downweighting  (omitting  as  q  ■*  1)  each  observation  in  turn  and 
recomputing  the  fitted  value  y^  and  the  standard  deviation  s^. 

The  situation  of  most  interest  is  when  x  is  large  (say  x  >^5).  Then  q 
approaches  unity  and  the  check  may  be  carried  out  by  calculating 
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f,  _  ri  1  -  V. 
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(4.21) 


Equation  (4.16)  brings  out  a  feature  of  the  checking  function  (4.2)  which  can  be  a 
disadvantage.  Differentiation  at  a  "  0  on  the  boundary  of  the  parameter  space 
ensures  that  only  the  possibility  of  one  bad  value  is  taken  account  of.  Thus  as  is 
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clear  from  (4.21)  D  in  its  present  form  would  not  be  expected  to  be  sensitive  to 
the  occurrence  of  two  or  more  bad  values.  Thus  with  x  «*  5,  we  obtain  the  value 


0  -  59. OS  for  Darwin's  data.  A  Monte  Carlo  study  with  S000  samples  of  15  observa¬ 
tions  shows  that  this  value  would  be  exceeded  by  chance  about  14%  of  the  time,  which 
hints  only  mildly  at  inadequacy  in  the  standard  model,  confirming  D' s  insensitivity 
for  this  example. 

4.6  Bayesian  options  for  specific  alternatives. 

When  concrete  alternatives  are  in  mind,  Bayesian  options  are  available.  In 
particular  the  predictive  ratio  p(y  |  A^)/p  (y  [  AQ) ,  mentioned  earlier,  is  a  conqaonent 
in  the  posterior  odds  ratio  which  with  suitable  priors  might  be  used  to  assess 
directly  the  relative  evidence  for  one  model  versus  another.  Also  g  (y)  of  (4.2) 

p  - 

has  a  Bayesian  interpretation  for,  if  corresponding  to  some  discrepancy  parameter  8, 
the  prior  distribution  p(6)  was  locally  flat  ther.  the  posterior  distribution 
p(8|yd)  would  be  proportional  to  the  predictive  density  p  (y^  j  8)  regarded  as  a 
function  of  8-  Furthermore  if  that  posterior  distribution  was  approximated  by  a 
normal  distribution,  then 


(4.22) 


and  a  second  differentiation  would  produce  a  standarized  variate. 

The  relation  shows  how  any  specific  predictive  check  gD(y)  is  linked  to  a 

p  - 

posterior  distribution.  In  particular,  considering  the  illustrative  examples  of 
the  last  section,  the  marginal  posterior  distribution  for  X  was  given  by  Box  and 
Cox  (1964),  for  4  by  Zellner  and  Tiao  (1964),  for  the  ridge  regression  parameters 
by  Lindley  and  Snith  (1972)  and  that  for  a  may  be  obtained  using  the  results  of 
Box  and  Tiao  (1968b)  and  Bailey  and  Box  (1980a). 

Befcre  leaving  the  topic  of  diagnostic  checks  two  final  points  need  to  be  trade: 
(i)  The  above  discussion  illustrates  the  "overlap"  previously  mentioned  when 
specific  alternatives  are  in  mind.  It  does  not  however  establish  the  omnipotence 
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of  purely  Bayesian  inference.  However  far  the  process  of  model  elaboration  is  taken 
by  Bayesian  methods  the  final  model  involving  say  the  mth  set  of  assumptions 

A  can  still  be  factored 

■ 

p(y,e|A)  -  p(e|y,A  )p(y|A  >  (4.23) 

»  •  m  -  •  m  •in 

thus  there  always  remains  an  unexplored  n-dimensional  predictive  distribution 

p(y|A  )  in  relation  to  which  a  small  relative  value  for  p(y  _|a  )  could,  on  a 
*  m  -a  m 

sampling  theory  argument,  discredit  the  assumptions  on  which  the  Bayes  analysis  was 
conditional.  The  same  is  true  of  the  more  plausible  of  two  models  chosen  using  a 
posterior  odds  ratio. 

(ii)  In  addition  to  possible  discrepancies  to  which  we  have  been  alerted  by 
experience,  other  features  may  appear  pointing  to  inadequacies  of  a  kind  not 
previously  suspected.  This  possibility  has  sometimes  proved  perplexing, 
for  while  on  the  one  hand  the  truly  unexpected  could  point  the  way  to 
precious  new  knowledge,  on  the  other,  associated  probabilities  would  be  indeterminate 
because  of  the  uncountable  character  of  other  features  that  might  also  have  been 
regarded  as  surprising.  I  think  the  calculation  which  ignores  this  difficulty  of 
indeterminate  selection  is  still  worth  making,  for  at  least  it  helps  to  correct  a 
misjudgement  of  something  that  appears  unusual  but  really  is  not.  For  example, 

Feller  (1968)  shows  that  for  a  random  group  of  30  people,  the  probability  that  at 
least  two  have  coincident  birthdays  is  over  70t;  this  tells  us  we  need  look  no  frr- 
ther  for  an  explanation  when  we  are  surprised  to  find  two  such  people  at  a  party. 
While  the  proposed  policy  will  lead  to  the  too  frequent  pursuit  of  nonexistent 
assignable  causes,  the  iterative  process  will  quickly  terminate  this  chase. 

S.  Robust  estimation. 

Efficient  iterative  model  building  requires  both  diagnostic  checking  and  model  robus- 
tlflcatlon,  where  by  robust! fi cation  I  mean  judicious  and  grudging  elaboration  of  the 
model  to  ensure  against  particular  hazards  (see  also  Box  (1979)).  Robust! fication 
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becomes  necessary  when  it  is  known  that  likely,  but  not  easily  detectable,  model 
discrepancies  can  yield  badly  misleading  analyses.  It  is  well  known,  for  example, 
that  least  squares  analysis  can  be  dramatically  affected  by  moderate  serial 
correlation  of  errors. 

Recently  the  serious  consequences  of  bad  values  on  standard  least  squares 
analysis  has  been  especially  emphasized  and  numerous  authors  have  proposed  methods 
which  rely  on  abandonment  or  modification  of  classical  estimation  methods.  In  dis¬ 
cussing  the  rationale  for  this  approach  Huber  (1977)  says  "The  traditional  approach 
to  theoretical  statistics  was  and  is  to  optimize  an  idealized  model  and  then  to  rely 
on  a  continuity  principle:  what  is  optimal  at  the  model  should  be  optimal  nearby. 
Unfortunately,  this  reliance  on  continuity  is  unfounded:  the  classical  optimal 
procedures  tend  to  be  discontinuous  in  the  statistically  meaningful  topologies." 

He  then  quotes  a  motivating  example  given  by  Tukey  (1960),  who  pointed  out  for 
example  that  if  a  normal  distribution  were  very  mildly  contaminated  with  another 
which  is  centrally  located  but  of  larger  variance,  then  the  sample  standard  deviation 
could  be  a  very  poor  estimate  of  scale.  Tukey's  contribution  was  remarkable  because  it 
had  previously  gone  unnoticed  that  the  assumption  that  the  same  structure  must  apply 
to  every  opservation  y^  (i  »  1,2 , . . .  ,n)  with  absolute  certainty  (1  -  a  -  1),  not 
only  was  unrealistic  (since  no  responsible  investigator  would  make  the  claim  that 
inadvertent  bad  values  were  inpossible),  but  also  could  have  serious  consequences. 
While  Huber  goes  on  to  say  that  typical  'good  data'  samples  in  the  physical  sciences 
appear  to  be  well  modeled  by  this  contaminated  normal  model,  he  does  not  develop 
methods  based  on  this  more  realistic  set  up.  This  is  presumably  because  his  objec¬ 
tion  would  apply  equally  to  the  new  as  well  as  to  the  old  model. 


I  do  not  agree  that  the  example  would  support  a  thesis  of  the  need  to  abandon 
model-based  procedures.  A  model  that  omits  the  parameter  a  is,  of  course,  the 
same  as  one  that  includes  it  but  sets  its  value  exactly  to  zero.  A  value  of  a  *  0, 
which  allows  no  possibility  whatever  for  bad  values,  and  a  value  of  e  “  0.001  are, 

I  think, not  close  in  any  statistically  meaningful  topology.  Although  0.001  may  look 
close  to  zero,  an  odds  ratio  of  0.99 V0. 001  »  999  for  a  "good"  to  a  "bad"  value  is 
obviously  very  different  from  one  of  infinity.  Such  differences  in  probability  dis¬ 
tinguish,  for  example,  a  lifeless  w>rld  in  which  no  evolution  could  possibly  occur 
from  the  one  we  live  in. 

The  proper  conclusion  to  draw  from  Tukey' s  example  is,  I  think,  that  for  many 
practical  situations  in  which  occasional  bad  values  are  to  be  expected  the  standard 
linear  model  provides  an  inadequate  approximation  vhat  is  potentially  misleading 
and  therefore  the  model  should  be  appropriately  changed  to  approximate  what  is 

believed  rather  than  what  is  not.  The  situation  is  logically  the  same  for  a  model 
that  inplicitly  insists  there  can  be  no  serial  correlation,  when  data  have  in 

fact  been  collected  serially,  or  that  no  transformation  of  y  could  be  needed  when 
y«.v/y.i .  large.  As  in  the  classical  Stein  problem  if  we  know  something  a  priori 
it  may  be  disastrous  to  omit  it.  On  this  view  for  robust  estimation  of  the  para¬ 
meters  of  interest  we  should  modify  the  model  which  is  at  fault,  rather  than  the 
method  of  estimation  which  is  not. 

5.1  Bayesian  robust  estimation. 

As  was  argued  for  example  by  Box  and  Tiao  (19C4),  all  relevant  aspects  of  the 
problem  are  brought  out  in  an  appropriate  Bayes  analysis.  Supposing  that  6  has 
the  same  physical  interpretation  for  all  6  then  estimation  of  8  which  is  robust 
relative  to  the  discrepancy  parameter  S  is  supplied  by  the  posterior  distribution 
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(5.1) 


p(0|y)  •=  /  p(o|e,y)pu(B|y)p(6)d8 

This  expression  contains  three  key  elements  that  repay  individual  study: 

(a)  the  sensitivity  of  inferences  about  0  to  changes  in  8  is  reflected  by 
p ( 0 1 B » y )  considered  as  a  function  of  8s 

(b)  the  information  about  8  coming  from  the  data  itself  is  reflected  in  the 

pseudo-likelihood  p^tsly)  =  p(8|y)/p(6)  “  p  (y  |  B> ;  (5.2) 

(c)  the  probability  of  occurrence  of  different  values  of  8  in  the  real  world 
is  represented  by  p{8)  which  can  be  chosen  to  approximate  what  is 

believed  or  feared. 

This  route  was  used  to  explore  deviations  from  the  standard  normal  model  for  a 
particular  class  of  heavy-tailed  distributions  by  Box  and  Tiao  (1962 ,  64) ;  for  the  contami- 
nated model  of  Section  4.5  by  Jeffreys  (1932 )  and  Box  and  Tiao  (1968b);  for  a  serial  correlation 

model  by  Zellner  and  Tiao  (1964);  for  a  transformation  problem  by  Box  and  Cox  (1964). 
Notice  that  using  this  approach  the  parameters  0  of  interest  are  completely  esti¬ 
mated  in  the  sense  that  their  distribution  rather  than  merely  a  point  estimate  is 
available.  Also  the  various  elements  of  p(6|y)  which  can  be  studied  individually 
can  provide  a  deep  understanding  of  each  robustness  problem.  A  particularly  infor¬ 
mative  display  shows  contours  of  the  joint  distribution  p^O.Bly)  for  some  para¬ 
meter  0  of  interest  and  the  discrepancy  parameter  8  together  with  the  marginal  dis¬ 
tribution  Pu  ( B 1 V ) -  When  a  less  prodigal  display  is  necessary  the  mean  and  standard 
deviation  of  p(o|8,y)  may  be  shown  with  Pu<8|y).  For  illustration  we  consider 
some  serial  data  analyzed  by  Coen,  Gomme  and  Kendall  (1969).  They  regressed 
quarterly  values  of  the  Financial  Times  Share  Index  y  on  detrended  lagged  values 
of  U.K.  car  production  X^,  and  of  the  Financial  Times  Commodity  Index  X2  using 
a  model*  which  could  be  written  (Box  and  Newbold  (1971))  as 

*For  the  present  purpose  we  retain  the  model  structure  of  Coen,  Gomme  and  Kendall. 
However  its  relevance  seems  dubious,  for  example,  a  multivariate  time  series 
analysis  by  Tiao  and  Box  (1980)  for  these  three  series  shows  the  stock  prices 
acting  as  a  weak  leading  indicator  for  the  commodity  index  Xj. 
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y  ■  +  6,t  +  0.x.  .  A  +  0.X_  _  +  e.  with  ■  ♦«  ,  +  a,.  (5.3) 

t  u  1  1  lit-b  2  2,t-7  t  t  t“l  t 

with  a^  white  noise,  and  $  constrained  to  be  equal  to  zero.  Figure  3  illustrates  an 
analysis  made  by  Pallesen  (1977)  in  which  $  is  unconstrained.  It  shows  the  joint 
posterior  distribution  for  9^  and  $  and  the  marginal  distribution  for  $  assum¬ 
ing  locally  flat  priors  for  6,  (no  and  $.  Although  for  this  example  serial 
correlation  could  have  been  easily  detected  by  diagnostic  checks,  notice  the  enormous 
shift  (about  five  standard  deviations)  of  the  conditional  distribution  p(6^|t,y) 
which  occurs  as  $  changes  from  zero  to  more  plausible  values.  This  illustrates  the 
point  that  smaller  serial  correlation,  of  a  magnitude  difficult  to  detect  with 
diagnostic  checks,  could  disastrously  invalidate  estimates  of  9. 

A  second  example  discussed  more  fully  in  Bailey  and  Box  (1980b)  further  illus¬ 
trates  this  approach  for  the  "bad  value"  problem  using  the  contaminated  normal  model 
of  Section  4.5.  The  data  were  used  originally  by  Box  and  Behnken  (1960)  to  illustrate 
the  analysis  of  a  balanced  incomplete  four  factor  three-level  design  with  n  -  27 
observations  arranged  in  three  blocks  of  nine.  A  residual  plot  suggests  the  possi¬ 
bility  of  two  bad  values  (y1Q  and  y^  ) .  However  the  small  number  of  residual 
degrees  of  freedom  and  the  nature  of  this  particular  design  would  induce  large  cor¬ 
relations  yielding  potentially  misleading  residual  patterns. 

Table  1  gives  Bayesian  means  and  standard  deviations  for  coefficients  in  the 
fitted  model 


'  -  e0*  1  Vi  *  1  “u-i  *  1  1  8ijVj 


(5.4) 


i-1  *  i-1  **  *  i-1  j>i 

In  this  analysis  k  was  set  equal  to  5  and  the  values  of  a  varied  over  the 
range  0  to  0.091.  It  has  been  shown  by  Chen  and  Box  (1979)  that  for  it  £  5  the 
posterior  distribution  is  mainly  a  function  of  e  ■  «/(l  -  a)<  so  the  results  are 
also  labelled  in  terms  of  this  dominant  discrepancy  parameter  e.  It  will  be 
noticed: 
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*0.5  0  0.5  1.0  1.5 


Autoregressive  parameter  $•* 

Fig.  3.  Joint  posterior  distribution  of  9^  and  $  and 

marginal  posterior  distribution  of  $.  Note  shift 
in  approximate  95%  interval  as  $  is  changed. 
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(a)  The  large  change  in  each  estimated  effect  and  standard  deviation  occurs 
when  no  possibility  whatever  of  bad  values  (c  «■  0)  is  replaced  by  a  small  possi¬ 
bility  (e  «  0.001).  For  good  data  the  typical  behavior  of  a  table  of  this  kind 

is  that  only  very  minor  changes  in  mean  and  standard  deviation  occur  as  c 
is  changed  over  the  plausible  range. 

(b)  For  all  the  estimates  except  the  standard  deviations  of  effects  are 

about  halved.  Thus  for  these  effects  the  use  of  the  more  appropriate  model  is 
equivalent  to  a  four-fold  increase  in  the  size/sensitivity  of  the  experiment.  This 
may  be  compared  for  example  with  a  parallel  analysis  by  Box  and  Cox  of  their  bio¬ 
logical  data  where  a  three-fold  increase  in  sensitivity  resulted  from  the  use  of  an 
appropriate  transformation. 

(c)  The  analysis  can  be  further  illuminated  by  considering  other  available 
quantities.  In  particular  a  plot  of  the  probability  that  the  ith  value  is  bad, 
given  that  one  value  is  bad,  (see  for  example  Abraham  and  Box  (1978)),  results  in 
a  plot  with  94%  of  th-s  probability  associated  with  the  tenth  observation  and  the 
remainder  spread  among  the  remaining  26  observations.  It  is  likely  therefore  that 
y10  alone  is  a  bad  value.  It  is  a  deficiency  of  the  design  being  used  here  that 
least  squares  estimates  of  interactions  employ  only  four  of  the  27  observations  and 
so  lack  robustness  to  bad  observations  (see  for  example  Box  and  Draper  (1975)).  In 
particular  6^  *=  0.25  (y^  -  y^  -  y  +  y^)  so  that  the  Bayesian  down-weighting 
of  Y10  accounts  for  the  large  change  in  this  estimate  and  the  increase  in  the 
standard  deviation. 

(d)  We  saw  in  the  case  of  ridge  regression  how  failure  to  take  account  of 
observational  information  could  lead  to  an  unrealistic  choice  of  the  discrepancy 
parameter  y.  To  complete  the  picture  therefore,  a  plot  of  the  marginal  distribu¬ 
tion  of  the  discrepancy  parameter  c  should  be  made  in  conjunction  with  Table  1 
(compare  also  with  the  serial  correlation  example  in  Figure  3).  For  this  data  the 
distribution  Pute|y)  has  its  mode  close  to  t  -  0.010. 
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c 

a 

0 

0 

.001 

.005 

.005 

.024 

.010 

.048 

.015 

.070 

.020 

.091 

eo 

90.60 
(  .94) 

90.60 
(  .45) 

90.60 
(  .41) 

90.60 
(  .41) 

90.60 
(  .41) 

90.60 
(  .41) 

*1 

1.93 
(  .47) 

2.46 
(  .28) 

2.49 
(  .23) 

2.49 
(  .22) 

2.49 
(  .22) 

2.49 
(  .22) 

*2 

-1.96 
(  .47) 

-1.96 
(  .22) 

-1.96 
(  .20) 

-1.96 
(  .20) 

-1.96 
(  .20) 

-1.96 
(  .20) 

63 

1.13 
(  .47) 

1.13 
(  .22) 

1.13 
(  .20) 

1.13 
(  .20) 

1.13 
(  .20) 

1.13 
(  .20) 

*4 

-3.68 
(  .47) 

-3.15 
(  .28) 

-3.12 
(  .23) 

-3.12 
(  .22) 

-3.12 
(  .22) 

-3.12 
(  .22) 

011 

-1.42 
(  .70) 

-1.88 
(  .44) 

-1.90 
(  .41) 

-1.90 
(  .41) 

-1.89 
(  .42) 

-1.89 
(  .42) 

e22 

-4.33 
(  .70) 

-4.10 
<  .36) 

-4.09 
(  .34) 

-4.09 
(  .34) 

-4.09 
(  .34) 

-4.09 
(  .34) 

*33 

-2.24 
(  .70) 

-2.01 
<  .38) 

-2.00 
(  .34) 

-2.00 
(  .34) 

-2.00 
(  .34) 

-2.00 

C  .  ?4 ) 

®44 

-2.58 
(  .70) 

-3.05 
(  .44) 

-3.06 
(  .41) 

-3.06 
(  .41) 

-3.06 
(  .42) 

-3.05 
(  .42) 

012 

-1.67 
(  .81) 

-1.67 
(  .39) 

-1.67 
(  .35) 

-1.67 
(  .35) 

-1.67 
(  .34) 

-1.67 
(  .34) 

B13 

-3.83 
(  .81) 

-3.82 
(  .39) 

-3.82 
(  35) 

-3.82 
(  .35) 

-3.82 
(  .34) 

-3.82 
(  .34) 

814 

.95 

(  .81) 

-.45 
(  .95) 

-.51 
(  .92) 

-.50 
(  .93) 

-.49 
(  .95) 

-.48 
(  .95) 

P23 

-1.67 
(  .81) 

-1.67 
(  .39) 

-1.67 
<  .35) 

-1.67 
(  .35) 

-1.67 
(  .35) 

-1.67 
(  .35) 

B24 

-2.62 
(  .81) 

-2.62 
(  .39) 

-2.62 
(  .35) 

-2.62 
(  .35) 

-2.62 
(  .35) 

-2 .  (>2 
(  .35) 

*34 

-4.25 
(  .81) 

-4.25 
(  .39) 

-4.25 
(  .35) 

-4.25 
(  .34) 

-4.25 
(  .34) 

-4.25 
(  .34) 

Table  1  Bayesian  means  and  (standard  deviations) 

for  polynomial  coefficients  using  the  contaminated 
node!  of  Section  4.5  with  k  «  5  (e  «  o/(l  -  o)k)  • 
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The  Bayes  approach  to  robust  estimation  has  the  advantage  of  generality;  furthermore 
it  clearly  reveals  at  any  given  stage,  on  precisely  what  assumptions  the  analysis  is 
conditional.  With  the  increased  speed  of  computers  and  availability  of  visual  dis¬ 
play  equipment  a  general  Bayesian  computer  program,  that  can  analyze  any  model  we 
wish  to  entertain,  seems  a  much  more  attractive  prospect  than  the  fresh  devising  of 
semi  ad  hoc  procedures  for  each  new  possibility. 

Same  parallels  in  the  two  approaches  are  briefly  considered  below  for  the 
"bad  value”  problem. 

5.2  Robust  estimation  for  the  "bad  value"  problem. 

For  the  "bad  value"  problem  a  wide  variety  of  semi-empirical  estimators  have 
been  proposed.  Among  these  are  the  M,L,  and  R,  and  various  kinds  of  adaptive 
estimators.  In  turn  -among  the  M  estimators  a  number  of  different  "i|>"  functions 
have  been  suggested  leading  to  different  ways  of  downweighting  extreme  observations. 

Now  consider  the  model  of  Section  4.S  for  the  simple  location  structure 
E(yi>  »  u.  Then  (see  for  example  Box  and  Tiao  (1968b))  the  Bayesian  mean  may  be 
written 

U-  l  P(b|y,a)?(b>  (5.5) 

b»0 

where'  p(b[y,a)  is  the  posterior  probability  that  there  are  b  bad  values  and 
y ^  is  the  corresponding  conditional  posterior  mean.  Consider  in  particular 
*  £  w^y^.  3Ksn  and  Box  (1979)  show  that  for  k  5 

Wj  *  (n  -  l)”l(l  -  Dj/D)  (5.6) 

n-1 

f  nr3  1~  2  ( 

"  U - ^  -  U  ♦  n’1^  2  (5.7) 

v  (n-1)  j  (  j 

where  and  r^  are  unweighted  and  weighted  residuals  defined  in  Section  4.5. 
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Figure  4(a)  and  <b)  show  plots  of  w=Wl  against  ^  and  ^  for  three  random  normal 
samples  of  ton  observations  from  a  normal  distribution  when  a  multiple  0,1(2,..., 
of  o  is  added  to  the  first  observation  in  each  sample.  Bnpirical  approximations 
for  these  weighting  curves  are  provided  by  the  functions 

w  *  .1  exp{-|.49  r^|7}  and  w  =  .1  exp{-|.3  r^J3'5} 

Also  shown  in  Figure  4(b)  for  comparison  is  Tukev's  bi weight  function 

I  n  212 

w  »  .  1<1  -  (~J  r  for  c  =  5.3  (chosen  to  roughly  match  the  curve).  Although  the 
Bayesian  weights  are  sample  dependent  they  remain  remarkably  stable  as  is  indicated 
(a)  by  the  smooth  manner  in  which  the  remaining  weight  is  evenly  spread  throughout 
the  non-discrepant  observations;  (b)  by  the  closeness  with  which  points  from  differ¬ 
ent  samples  follow  the  same  curve. 

The  estimate  y  is  sample  adaptive  in  another  more  striking  way  however,  "or 
illustration  consider  the  case  where  the  p(b|y,a)  are  negligible  for  b  ^  2.  Then 
writing  p  =  p(l|y,a)  (5.5)  becomes 

V  »  (1  -  p)y  +  py(1*  (5.8) 

and  the  Bayesian  mean  is  an  interpolation  between  y  and  the  "robustified"  y^\ 

In  this  expression  the  value  of  p  is  determined  by  the  posterior  odds  ratio  for 
one  versus  no  bad  values 

->/(!  -  P)  “  e{n/(n  -  1)}  D  t  -  a/ (1  -  a)*  (5.9) 

and  D  is  the  checking  function  encountered  earlier. 

Sample  adaptivit/  is  evidenced  as  follows.  For  a  sample  with  no  outliers  y 
and  y^1*  are  not  very  different  so  that  y  is  close  to  y.  But  in  the  presence 
of  an  outlier  of  largjr  and  larger  size  two  things  happen:  the  outlier  is  down¬ 
weighted  in  y^  which  becomes  more  and  more  different  from  y  and  also  p 
becomes  larger  placing  more  and  more  emphasis  on  y^’ 

The  purpose  of  this  discussion  is  to  show  that  sensible  solutions  which  appro¬ 
priately  downweight  suspected  bad  values  may  bo  obtained  directly  from  an  appropriate 
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a  Normal  Distribution.  Numbers  0,1#2,...  indicate 
that  0,o, 2o , • • •  has  been  added  to  y  . 


r 


model.  From  the  viewpoint  of  the  traditional  M  estimator,  the  weight  function 
(W  say)  for  p  itself  is  an  interpolation  W  =  (1  -  p)  jj-  +  pw  between  1/n 
and  w.  Thus  W  will  descend  to  the  value  (1  -  p) /n  for  large  r.  For  a  sample 
containing  a  large  outlier,  1  -  p  will  be  negligible  and  W  will  approach  w 
plotted  in  Figure  4(b)  and  will  descend  like  Tukey's  biweight.  However  for  a  more 
normal-looking  sample  W  will  flatten  out  to  some  moderate  non-zero  value  and  will 
more  closely  resemble  the  weighting  originally  proposed  by  Huber. 

In  choosing  robust  estimators  there  is  room  for  empiricism  but  I  think  that  some 
of  its  inspiration  should  be  applied  to  the  choice,  study,  and  consequences  of 
appropriate  parsimonious  models.  The  structure  of  the  resulting  Bayesian  analysis 
should  in  each  case  be  carefully  analyzed,  for  the  great  strength  of  such  a  model-based 
approach  is  that  the  exact  consequences  of  whatever  goes  into  the  model  must  come 
out.  These  consequences  will  either  agree  with  "common  sense"  or  they  will  not.  If 
they  do  not  then  we  know  either  that  what  went  in  was  inappropriate  in  a  way  we  had 
failed  to  foresee,  or  else,  as  happens  quite  frequently,  that  our  common  sense  was 
too  shortsighted.  In  either  case  we  learn  something. 
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